Back

American Journal of Epidemiology

Oxford University Press (OUP)

Preprints posted in the last 90 days, ranked by how well they match American Journal of Epidemiology's content profile, based on 57 papers previously published here. The average preprint has a 0.05% match score for this journal, so anything above that is already an above-average fit.

1
HHBayes: A Flexible Bayesian Framework for Simulating and Analyzing Household Transmission Dynamics

Li, K.; Hou, Y.; Mukherjee, B.; Pitzer, V. E.; Weinberger, D. M.

2026-04-03 infectious diseases 10.64898/2026.04.01.26349903 medRxiv
Top 0.1%
28.9%
Show abstract

Household transmission studies are important for understanding infectious disease transmission and evaluating interventions; however, they are frequently constrained by methodological challenges, including in study design and sample size determination, and in estimating parameters of interest after collecting the data. Existing tools often lack flexibility in modeling age-specific susceptibility, infectivity patterns, and the impact of interventions such as vaccination or prophylaxis. Here, we develop HHBayes, an open-source R package that provides a unified framework for simulating and analyzing household transmission data using Bayesian methods. The package enables researchers to: (1) simulate realistic household transmission dynamics with highly customizable variables; (2) incorporate viral load data (measured in viral copies/mL or cycle threshold values) to model time-varying infectiousness; (3) estimate age-dependent susceptibility and infectivity parameters using Hamiltonian Monte Carlo methods implemented in Stan; and (4) evaluate intervention effects through user-defined covariates that modify susceptibility or infectivity. We demonstrate the capabilities of the package through simulation studies showing accurate parameter recovery and applications to seasonal respiratory virus transmission, including the impact of vaccination and antiviral prophylaxis on household attack rates. HHBayes addresses a critical gap in infectious disease epidemiology by providing researchers with accessible tools for both prospective study design and retrospective data analysis. The flexibility of the package in handling complex household structures, time-varying infectiousness, and intervention effects makes it valuable for studying diverse pathogens.

2
Quantifying bias from reverse causation in observational studies of dementia risk factors: A simulation study informed by age-specific reverse Mendelian Randomization

Wang, J.; Ackley, S.; Chen, R.; Kezios, K.; Zeki Al Hazzouri, A.; Blacker, D.; Torres, J. M.; Glymour, M. M.

2026-02-23 epidemiology 10.64898/2026.02.21.26346807 medRxiv
Top 0.1%
26.7%
Show abstract

BackgroundThe long preclinical phase of dementia can bias estimated effects of baseline exposures on dementia incidence. We demonstrate simulations informed by reverse Mendelian randomization (MR) findings to quantify the age-specific magnitude of reverse causation bias in analyses in observational studies of the effects of body mass index (BMI) on dementia. MethodsWe simulated longitudinal trajectories of BMI and dementia risk from ages 45 to 90 years, calibrating to published evidence on age-specific dementia incidence, BMI, and associations of dementia genetic risk with BMI. Under the null that BMI does not influence dementia and an alternative that BMI at any age increases subsequent dementia risk, we simulated hypothetical cohort studies (n=20,000, average 15 years of follow-up), varying age of entry from 45 to 80 years. In each hypothetical cohort, the association of z-standardized BMI at study entry and dementia incidence were estimated using Cox proportional hazards models. Bias was quantified using the ratio of observed to true hazard ratios (RHRs). All scenarios were replicated 500 times. ResultsIn the absence of a causal effect of BMI on dementia, when follow-up began at age 65 years, the RHR was 0.91 (95% CI: 0.90-0.92). When follow-up began at age 80 years, the RHR decreased to 0.68 (95% CI: 0.67-0.69), indicating substantial bias attributable to reverse causation. ConclusionReverse causation, presumably arising from preclinical dementia, can induce substantial bias in estimates of the association between baseline exposures and dementia incidence. Simulations provide a convenient tool to quantify this bias.

3
Incorporating Uncertainty in Study Participants' Age in Serocatalytic Models

Chen, J.; Lambe, T.; Kamau, E.; Donnelly, C.; Lambert, B.; Bajaj, S.

2026-03-16 infectious diseases 10.64898/2026.03.14.26346885 medRxiv
Top 0.1%
23.1%
Show abstract

AO_SCPLOWBSTRACTC_SCPLOWSerological surveys measure the presence of antibodies in a population to infer past exposure to an infectious pathogen. If study participants ages are known, serocatalytic models can be used to retrace the historical transmission strength of a pathogen within that population, quantified by the force of infection (FOI). These models rely on age information as a key variable since infection risks are interpreted in relation to how long individuals have been at risk. However, due to data constraints, participants ages may be provided only within "age bins". A common approach is then to assign individuals ages to be midpoints of their respective age bins, ignoring uncertainty in this quantity. In this study, we quantify the bias introduced by this midpoint approach and develop a Bayesian framework that explicitly accounts for uncertainty in age. By comparing inference under constant, age-dependent, and time-dependent FOI scenarios, we show that incorporating uncertainty in age in serocatalytic models yields more reliable FOI estimates without sacrificing computational complexity. These improvements support the interpretation of serological data and inform public health decisions, such as estimating disease burden and identifying targeted vaccination groups.

4
Physical activity and body mass index inequities among adult women in the United States: An application of intersectional multilevel analysis of individual heterogeneity and discriminatory accuracy (I-MAIHDA)

Echeverria, S.; Seo, Y.; Borrell, L. N.; McKelvey, D.; Najjar, T.; Reifsteck, E. J.; Erausquin, J. T.; Maher, J. P.

2026-04-07 epidemiology 10.64898/2026.04.06.26350273 medRxiv
Top 0.1%
22.4%
Show abstract

Background Physical activity (PA) and body mass index (BMI) shape cardiovascular risk, particularly in women. Yet, little research exists examining intersectional social axes shaping PA and BMI inequities among women living in the United States (US). Methods Data included women sampled in the 2015-2020 National Health and Nutrition Examination Survey. We used Intersectional Multilevel Analysis of Individual Heterogeneity and Discriminatory Accuracy (I-MAIHDA) via linear models to examine PA (n=,4591) and BMI (n=4,596) inequities across intersectional strata defined by race/ethnicity, age, education, nativity, and work status. We further quantified the contribution of these strata to the observed inequities and estimated additive fixed effects. Results In the null model, intersectional strata explained 4.6% and 13.8% of the variance in PA and BMI inequities, respectively, with 99.2% for PA and 97.5% for BMI explained by age, race/ethnicity, education, nativity, and occupation status. On average, Asian and Black women, those aged 35-49 years, those born outside the US, and those with less than a high school diploma had the lowest predicted mean PA. For BMI, Black and Hispanic/Latino women and those younger than 64 years had the highest mean BMI. Conclusion PA and BMI inequities are mostly explained by race/ethnicity, age, education, nativity, and work status. Our findings offer insights into universal and potential policy-informed health promotion strategies that may be tailored to women with these social identities and lived experiences that have shaped physical activity and body mass index inequities.

5
The pitfalls of incidence-based time series regression for inferring the effects of weather on infectious diseases

Gemo, P.; Barrero Guevara, L. A.; Kussmaul, C.; Kramer, S. C.; Domenech de Celles, M.

2026-03-15 epidemiology 10.64898/2026.03.13.26348326 medRxiv
Top 0.1%
22.2%
Show abstract

1A central question in environmental epidemiology is how the weather affects infectious diseases. Time-series regression (TSR) on population-level case incidence data is widely used to estimate weather effects; however, this design may be biased due to the complexities of infectious disease dynamics, including nonlinear feedback, various types of noise, and latent, dynamic variables such as population immunity. Here, we assess the reliability of incidence-based TSR through a controlled simulation study across four different climates and fifty scenarios representing different pathogens. For each scenario, we simulated 10 years of weekly incidence data using a simple transmission model that included real-world weather data on temperature and relative humidity. We then examined whether the ground-truth weather effects could be recovered from model simulations using negative binomial generalized additive models, a flexible class of TSR models commonly used in empirical applications. We find that these models frequently fail to yield accurate and precise estimates of weather effects, even under favorable conditions such as no process noise and low observation noise (overdispersion). Hence, our results caution against the indiscriminate use of TSR models and suggest that more mechanistic approaches are needed for statistical inference of weather effects from population data.

6
Methodological Considerations in Sibling Analyses of Prenatal Acetaminophen

Ahlqvist, V. H.; Sjoqvist, H.; Gardner, R. M.; Lee, B. K.

2026-03-30 epidemiology 10.64898/2026.03.27.26349515 medRxiv
Top 0.1%
18.9%
Show abstract

Background: Sibling-matched designs control for shared familial confounding but remain vulnerable to non-shared confounders. Bi-directional sensitivity analyses, which stratify families by whether the older or younger sibling was exposed, are commonly used to assess carryover effects. We aimed to demonstrate how this methodological approach can introduce severe confounding by parity. Methods: We conducted simulations motivated by a recent epidemiological study. The true causal effect of a hypothetical exposure (prenatal acetaminophen) on neurodevelopmental outcomes was set to strictly null. To introduce parity-related confounding, baseline exposure and outcome probabilities were varied slightly by birth order. We compared conditional logistic regression effect estimates from total sibling models against bi-directional stratified models. Results: In the total simulated sibling cohort, models yielded the true null effect (odds ratio = 1.00) when adjusting for parity. However, the bi-directional analyses exhibited divergent artifactual signals. Because parity is perfectly collinear with exposure in these stratified subsets, it cannot be adjusted for. For example, when the older sibling was exposed, the odds ratio for autism spectrum disorder was 1.68; when the younger was exposed, the odds ratio was 0.60. Conclusions: Divergent estimates in bi-directional sibling analyses can be a predictable artifact of parity confounding rather than evidence of carryover effects or invalidating unmeasured bias. Overall sibling models adjusting for parity may remain robust despite divergent stratified sensitivity results.

7
Constructing and analyzing a synthetic life course cohort based on pooling two data sources: A case study of early adulthood depression symptomatology and late-life cognition

Zimmerman, S. C.; Buto, P.; Kezios, K.; Zeki Al Hazzouri, A.; Glymour, M. M.

2026-02-27 epidemiology 10.64898/2026.02.25.26347113 medRxiv
Top 0.1%
18.9%
Show abstract

BackgroundSynthetic cohorts created by combining two cohorts can be useful when no single data set includes both the exposure and outcome data of interest. We estimate the effects of depression in early adulthood on later-life memory outcome using two nationally representative cohorts separately and in a synthetic sample. MethodsWe used the National Longitudinal Study of Youth 1979 (NLSY; N=5,747) and the Health and Retirement Study (HRS; N=6,846) and a synthetic cohort combining exposure data from N=5,680 NLSY participants (born 1957-1965) aged 55-63 in 2020 who completed midlife cognitive assessment between 2006-2020 with outcome data from N=9,726 HRS participants born 1957-1964 who completed cognitive assessments when 47-63 years old and every 2-years thereafter. A 6-item version of the Centers for Epidemiologic Studies-Depression (CES-D) score (range 0-6) was measured from late adolescence through midlife in NLSY and in midlife in HRS. Memory was measured as the sum of immediate and delayed word recall scores up to twice in NLSY at age 48+ and up to 10 times in HRS at age 50+. We generated a synthetic life course cohort, matching HRS participants to NLSY participants based on 10 variables measured in midlife in both cohorts and posited to either confound or mediate the association between early life depressive symptoms and late-life memory. Matching variables included midlife depression and memory. We used confounder-adjusted linear mixed models to estimate the association between earliest reported depressive symptoms in NLSY and HRS with memory in the respective data sets and evaluated associations of early life depression symptoms with the repeated later life memory measures in the synthetic cohort. ResultsIn NLSY, each increment in CES-D at age 23-31 was associated with lower average memory scores ({beta}NLSY_level=-0.050 95%CI (-0.097,-0.003)) in midlife but no detectable difference in rate of memory decline ({beta}NLSY_slope=-0.070 95%CI (-0.382,0.242). In HRS, CES-D at average age 53 was associated with lower average memory ({beta}HRS_level=-0.163 (-0.199, -0.128)) but not rate of decline ({beta}HRS_slope=-0.021 (-0.062, 0.020)). In the synthetic cohort, CES-D at age 23-27 was associated with lower memory score at age 50+ ({beta}synth_level=-0.044 95%CI (-0.085,-0.003)) but not associated with rate of cognitive decline ({beta}synth_slope=0.005 95%CI (-0.052,0.062)). ConclusionsDepressive symptoms ages 23-31 predicted mid- to late-life memory function but had no clear association with memory decline. Combining data across cohorts spanning separate, but overlapping, parts of the life course is a promising approach to overcome data limitations in life course research, but it requires careful implementation to ensure that assumptions are met and estimates are appropriately interpreted.

8
Novel Representations of Vaccine Protection Against Progression to Severe Disease Over Time

Dean, N.; Zarnitsyna, V.

2026-02-14 epidemiology 10.64898/2026.02.12.26346197 medRxiv
Top 0.1%
18.4%
Show abstract

BackgroundVaccines can prevent severe disease by preventing infection or by reducing progression among those who become infected. Vaccine effectiveness against progression given infection is often used to quantify this second mechanism, but it conditions on infection, which is itself affected by vaccination. As a result, this estimand lacks a clear causal interpretation and may behave non-intuitively over time. MethodsWe introduce a conceptual framework that models protection against infection and protection against progression as separate components that wane over time. Protection is represented using individual-level threshold-crossing times that depend on covariates and define a time-varying population susceptible to infection. Within this framework, we derive standard vaccine effectiveness estimands and propose two alternative decompositions of protection against severe disease: a progression-risk-weighted multiplicative decomposition and an additive decomposition based on absolute risk reduction. We illustrate their behavior using simulated examples. ResultsThe weighted multiplicative decomposition restores a causal interpretation for progression protection within the doomed principal stratum and avoids negative estimates. The additive decomposition provides a clear representation of the pathways over time. ConclusionsExplicitly modeling the infection-to-severe-disease pathway improves interpretation of vaccine effectiveness under waning immunity.

9
Social factors and lifespan inequality: a four-way factorial analysis of U.S. lifespan

Caswell, H.

2026-03-12 public and global health 10.64898/2026.03.11.26348159 medRxiv
Top 0.1%
18.2%
Show abstract

BackgroundLifespan inequality arises both from heterogeneity (e.g., in sex or race) and from unavoidable individual stochasticity. By treating a heterogeneous population as a mixture we can (and many have) partition variance in lifespan into a between-group component due to heterogeneity and a within-group component due to chance. Until now, such studies have treated factors singly. It is now possible to analyze multiple factors and their contributions to variance. ObjectiveThis paper is the first to exploit the new analysis for multi-factor studies. Multi-factor data are painfully rare, but a remarkable study by Bergeron-Boucher et al. presented U.S. life tables under all 54 combinations of four factors (sex, marital status, education, race). Our objective is to quantify the contributions of these factors and their interactions to lifespan inequality. MethodsThe population is treated as a mixture of 54 groups, with a mixture distribution either flat or proportional to population size of the different factor combinations. Components of the variance in remaining longevity, for starting ages from 30 to 85 years, are calculated using marginal mixture distributions. ResultsEven accounting for four factors and their interactions, between-group heterogeneity accounts for only 7% (population-weighted mixing) to 10% (flat mixing) of lifespan variance. Education and its interactions make the largest contribution. Contributions of two-way, three-way, and four-way interactions are orders of magnitude smaller. This suggests new ways of displaying, summarizing, and interpreting inequality as measured in multi-factor studies. ContributionMulti-factor studies can now be used to identify sources of variance in longevity and other demographic outcomes.

10
Sexual risk behaviours following medical male circumcision: a matched pseudo-cohort analysis using population-based survey data

Mwakazanga, D. K.; daka, v.; Gwasupika, J. K.; Dombola, A. K.; Kapungu, K. K.; Khondowe, S.; Chongwe, G. K.; Fwemba, I.; Ogundimu, E.

2026-04-13 epidemiology 10.64898/2026.04.11.26350676 medRxiv
Top 0.1%
15.0%
Show abstract

Medical male circumcision (MMC) is an established HIV prevention intervention, yet concerns persist that circumcised men may adopt higher-risk sexual behaviours following the procedure. Evidence from observational studies has been inconsistent, partly because many analyses do not adequately distinguish behaviours that occur before circumcision from those that occur afterward. This study assessed the association between MMC and subsequent sexual behaviours while demonstrating how population-based cross-sectional survey data can be adapted to address this temporal challenge. We analysed nationally representative data from the 2024 Zambia Demographic and Health Survey (ZDHS), including men aged 15 - 59 years who reported their circumcision status. Men who had undergone medical circumcision were compared with uncircumcised men using a matched pseudo-cohort framework that reconstructed temporal ordering based on age at circumcision. Propensity score overlap weighting was applied to improve comparability between circumcised and uncircumcised men, and odds ratios were estimated using logistic regression models incorporating overlap weights and accounting for the complex survey design. Sexual behaviour outcomes occurring after circumcision included condom non-use at last sexual intercourse, multiple sexual partners in the past 12 months, self-reported sexually transmitted infection (STI) symptoms, and composite measures of sexual risk behaviour. The analysis included 9,609 men, of whom 33.3% were medically circumcised. MMC was associated with lower odds of condom non-use at last sexual intercourse (adjusted odds ratio [aOR] = 0.75, 95% confidence interval [CI]: 0.67 - 0.85) and lower odds of reporting any sexual risk behaviour (aOR = 0.83, 95% CI: 0.72 - 0.95). No meaningful associations were observed between MMC and reporting multiple sexual partners, self-reported STI symptoms, or higher levels of composite sexual risk behaviour. In this population-based study, MMC was not associated with sexual risk compensation under routine programme conditions within the overlap population defined by the weighting scheme, supporting the behavioural safety of MMC and illustrating the value of explicitly addressing temporality when analysing behavioural outcomes using cross-sectional survey data.

11
Mapping the Dynamic Interplay of Mental Health and Weight Across Childhood: Data-Driven Explorations Using Causal Discovery

Larsen, T. E.; Lorca, M. H.; Ekstrom, C. T.; Vinding, R.; Bonnelykke, K.; Strandberg-Larsen, K.; Petersen, A. H.

2026-04-17 epidemiology 10.64898/2026.04.16.26350943 medRxiv
Top 0.1%
14.3%
Show abstract

Childhood weight development, especially overweight and obesity, has been associated with mental health, but their dynamic, causal relationships, and whether these differ by sex, remain unclear. We applied causal discovery to data from the Danish National Birth Cohort (n=67,593) spanning six periods from pregnancy to late adolescence and considering 67 variables related to child and parental weight, mental health, lifestyle, and socio-economic factors. We found no statistically significant difference between the causal graphs for boys and girls (P=0.079). The data-driven models found causal influence of childhood weight on subsequent weight status. Mental health pathways were exclusively within or across adjacent periods and centered on early adolescent stress. We examined the interplay between a subset of mental health variables, containing information on externalizing and internalizing problems, and weight, and found no direct causal pathway between the two processes. These findings suggest that observed links between weight and these mental health measures may be attributable to confounding. Our findings demonstrate the value of data-driven causal discovery in large cohort studies and how to test for differences in causal mechanisms across subgroups. Results are available in an interactive application, enabling future research to further explore the interplay between weight and mental health.

12
Associations of alcohol use in early and middle adulthood with mid- and late-life cognition - a synthetic cohort approach

Buto, P. T.; Zimmerman, S. C.; Kezios, K.; Zeki Al Hazzouri, A.; Glymour, M. M.

2026-03-04 epidemiology 10.64898/2026.02.27.26346914 medRxiv
Top 0.1%
12.8%
Show abstract

OBJECTIVEUsing two cohorts and synthetic datasets, we estimated effects of prospectively reported alcohol use on memory outcomes across middle age. METHODSData were from National Longitudinal Study of Youth 1979 (NLSY79, n=7540, alcohol reports from ages 18-26), Health and Retirement Study (HRS age 50-56 at enrollment, n=13,090), and a synthetic cohort matching early life exposure information from 3,259 NLSY79 participants to later life memory information from 5,451 HRS participants. Covariate-adjusted linear mixed models regressed memory (word list recall) on alcohol use (none, light/moderate, heavy). RESULTSIn NLSY, we found no evidence that associations between light/moderate drinking in early adulthood and mid-life memory score significantly differed from associations between drinking abstention ({beta} = -0.09 (95% CI: -0.30, 0.11)) or heavy drinking ({beta} = -0.26 (-0.48, -0.04)) with memory score. In HRS, both abstaining from alcohol ({beta} = -0.14 (-0.25, -0.02)) and heavy drinking ({beta} = -0.25 (-0.42, -0.07)) were negatively associated with cognitive level. Results from the synthetic cohort mirrored NLSY, suggesting no significant association between abstention ({beta} = 0.13 (-0.10,0.36)) nor heavy drinking ({beta} = 0.02 (-0.25,0.28)) with mid-to-late life memory score. DISCUSSIONAlcohol consumption may not have an effect on memory until later life, though associations may be affected by residual confounding.

13
Methodological Guidance for Predictor Variable Selection for Adolescent Smoking Outcomes in Global Youth Tobacco Survey Using R and Python

Ng'ambi, W. F.; Zyambo, C.; Kazembe, L.

2026-02-17 epidemiology 10.64898/2026.02.14.26346305 medRxiv
Top 0.1%
12.3%
Show abstract

BackgroundThe Global Youth Tobacco Survey (GYTS) is widely used to monitor tobacco use among adolescents worldwide. However, inconsistent analytical approaches particularly in handling complex survey designs and predictor selection limit comparability across countries, survey waves, and software platforms. Although much of the GYTS literature relies on proprietary tools such as SAS and SPSS, practical and transparent guidance on implementing reproducible, theory-informed analyses remains limited. A unified workflow that respects the surveys design while supporting cross-platform implementation is needed. MethodsWe developed a reproducible, open-source workflow for analysing GYTS data using R and Python. In R, analyses were conducted using the survey package (svydesign and svyglm) with constrained stepwise selection via stepAIC. In Python, a custom constrained stepwise procedure was implemented using statsmodels generalized linear models. The workflow explicitly incorporates survey weights, stratification, and clustering; harmonises variables across countries; protects a priori demographic covariates; and ensures consistent treatment of categorical predictors. The approach is illustrated using data from Zambia (n = 2,959) and pooled data from Ghana, Mauritius, Seychelles, and Togo (n = 15,914). Predictor selection was guided by Social Cognitive Theory and evidence from systematic reviews. ResultsThe constrained selection framework consistently retained key demographic variables (age, sex, and grade) while allowing data-driven selection of modifiable predictors using the Akaike Information Criterion. When identical constraints were applied, the R and Python implementations selected identical models and produced nearly equivalent point estimates (adjusted odds ratio differences <0.01), although Python-based confidence intervals did not account for clustering. Of 18 candidate predictors across individual, social, media, and policy domains, 14 were retained. The strongest independent predictors included awareness of tobacco products (OR = 5.61, 95% CI: 4.65- 6.78), peer smoking (OR = 4.57, 95% CI: 3.34-6.25), and exposure to tobacco marketing (OR = 2.34, 95% CI: 1.89-2.91). ConclusionsThis study provides a generalisable, theory-informed framework for predictor selection in complex survey data using open-source tools. The workflow supports consistent analyses across countries, survey waves, and software platforms, and is transferable to other youth and adult population surveys. All code and harmonisation resources are openly available to support reproducibility and adaptation. Plain-Language SummaryO_LIWhat we asked: Can we predict adolescent smoking using GYTS data in a way that is easy to follow and reproducible across software? C_LIO_LIWhat we did: Built a single workflow that respects survey design (weights, strata, clusters) and selects predictors using four explicit criteria: theoretical grounding in Social Cognitive Theory, empirical support from prior studies, relevance for intervention, and cross-country validity. Core demographics (age, sex, grade, region) were protected as essential confounders, while other predictors were selected based on statistical fit. The workflow runs equivalently in R and Python. C_LIO_LIWhy it matters: Many GYTS studies use weights only and ignore clustering and stratification, which makes confidence intervals too narrow. More importantly, most analyses include variables arbitrarily or let software drop important confounders automatically. Our approach ensures theoretically meaningful, policy-relevant variables are retained, producing more reliable and actionable results for prevention programs. C_LI

14
Causal estimands and target trials for the effect of lag time to treatment of cancer patients

Goncalves, B. P.; Franco, E. L.

2026-04-08 epidemiology 10.64898/2026.04.07.26350338 medRxiv
Top 0.1%
12.2%
Show abstract

Timeliness of therapy initiation is a fundamental determinant of outcomes for many medical conditions, most importantly, cancer. Yet, existing inefficiencies in healthcare systems mean that delays between diagnosis and treatment frequently adversely affect the clinical outcome for cancer patients. Although estimates of effects of lag time to therapy would be informative to policymakers considering resource allocation to minimize delays in oncology, causal methods are seldom explicitly discussed in epidemiologic analyses of these lag times. Here, we propose causal estimands for such studies, and outline the protocol of a target trial that could be emulated with observational data on lag times. To illustrate the application of this approach, we simulate studies of lag time to treatment under two scenarios: one in which indication bias (Waiting Time Paradox) is present and another in which it is absent. Although our discussion focuses on oncologic outcomes, components of the proposed target trial could be adapted to study delays for other medical conditions. We believe that the clarity with which causal questions are posed under the target trial emulation framework would lead to improved quantification of the effects of lag times in oncology, and hence to better informed policy decisions.

15
Comparison of methods for assessing effects of risk factors on disease progression in Mendelian randomization under index event bias

Zhang, L.; Higgins, I. A.; Dai, Q.; Gkatzionis, A.; Quistrebert, J.; Bashir, N.; Dharmalingam, G.; Bhatnagar, P.; Gill, D.; Liu, Y.; Burgess, S.

2026-03-02 epidemiology 10.64898/2026.02.26.26347193 medRxiv
Top 0.1%
12.0%
Show abstract

Mendelian randomization has emerged as a transformative approach for inferring causal relationships between risk factors and disease outcomes. However, applying Mendelian randomization to disease progression - a critical step in validating pharmacological targets - is hampered by index event bias. This form of selection bias occurs because analyses of disease progression are necessarily restricted to individuals who have already experienced the disease event. Here, we present a comprehensive evaluation of statistical methods designed to mitigate index event bias, including inverse-probability weighting, Slope-Hunter, and multivariable methods. We compare the performance of these methods in simulations and applied examples. Inverse-probability weighting methods reduce bias, but require individual-level data and will only fully eliminate bias when the disease event model is correctly specified. Slope-Hunter performed poorly in all simulation scenarios, even when its assumptions were fully satisfied. Multivariable methods worked best when including genetic variants that affect the incident disease event. However, if these genetic variants also affect disease progression directly, then the analysis will suffer from pleiotropy. Hence, if the same biological mechanisms affect disease incidence and progression, then multivariable methods will have little utility. But in such a case, analyses of disease progression are less critical, as conclusions reached from analyses of disease incidence are likely to hold for disease progression. Our findings indicate that no single method is a universal solution to provide reliable results for the investigation of disease progression. Instead, we propose a strategic framework for method selection based on data availability and biological context.

16
The joint effects of exposure to prenatal pesticides and psychosocial factors on epigenetic age acceleration in the first 5 years of life in a South African birth cohort.

Abrishamcar, S.; Eick, S. M.; Everson, T.; Suglia, S. F.; Fallin, M. D.; Wright, R. O.; Andra, S. S.; Chovatiya, J.; Jagani, R.; Barr, D. B.; Lussier, A. A.; Dunn, E. C.; MacIsaac, J. L.; Dever, K.; Kobor, M. S.; Hoffman, N.; Koen, N.; Zar, H. J.; Stein, D. J.; Hüls, A.

2026-04-05 epidemiology 10.64898/2026.04.03.26350118 medRxiv
Top 0.1%
10.7%
Show abstract

Background Prenatal exposure to pesticides and psychosocial factors often co-occurs, particularly in low- and middle-income settings, yet their joint effects on epigenetic age acceleration (EAA) in early life remain unknown. We investigated the joint associations of prenatal pesticides metabolites and psychosocial factors on EAA in the first five years of life in the South African Drakenstein Child Health Study. Methods In 643 mothers, we measured 11 urinary pesticide metabolites and seven psychosocial factors during the second trimester of pregnancy. Child DNA methylation was measured in whole blood at ages 1, 3, and 5 years. EAA was estimated using the Horvath, Skin & Blood Horvath (skinHorvath), and Wu epigenetic clocks. Longitudinal associations were estimated using generalized estimating equations, adjusted for confounders. Joint mixture associations were evaluated using weighted quantile sum regression (WQS) and quantile g-computation (QGCOMP). Results The joint prenatal exposure mixture was positively associated with Wu ({beta} per one quintile increase in the mixture [95% CI]: 0.41 years [0.15, 0.80]), skinHorvath (0.11 years [0.06, 0.16]), and Horvath EAA (0.31 years [0.20, 0.46]) over time using WQS. Psychosocial factors, particularly food insecurity, physical interpersonal violence, and stress biomarkers, contributed most to the total mixture effect for all clocks. Pyrethroid metabolites PBA and TDCCA were top pesticide contributors to Wu EAA. Pathway enrichment analyses of clock-specific CpGs revealed distinct biological architectures, with the Wu clock enriched for neurodevelopmental and immune pathways, and metabolic pathways for the Horvath clock. Discussion Joint prenatal exposure to pesticides and psychosocial factors was associated with increased EAA across early childhood, with psychosocial factors contributing the most to the total effect. These findings highlight the importance of assessing chemical and non-chemical stressors jointly and clock-specific biological interpretation in epigenetic aging research.

17
GPS Mobility Tracking, Ecological Momentary Assessment, and Qualitative Interviewing to Specify How Space Produces Intersectional Health Inequities: Development and Pilot Testing of the Spatial Intersectionality Health Framework (SIHF) and IGEMA Methodology

Cook, S. H.

2026-04-13 epidemiology 10.64898/2026.04.09.26350546 medRxiv
Top 0.1%
10.2%
Show abstract

Background. Young sexual and gender minorities of color face compound health risks shaped by interlocking systems of racism, cisgenderism, and class inequality. Spatial health research documents that place shapes health, but existing methods cannot specify the mechanisms through which spatial configurations produce different health outcomes for differently positioned people. This gap prevents targeted intervention. ObjectiveTo develop and pilot test the Spatial Intersectionality Health Framework (SIHF), which specifies three mechanisms through which space produces intersectional health inequities: Layered (multiple oppressive systems activating simultaneously), Positional (the same space producing different health pathways by intersectional position), and Conditional (nominally protective spaces carrying hidden costs for specific positions). We also introduce and validate Intersectional Geographically-Explicit Ecological Momentary Assessment (IGEMA) as the methodology operationalizing SIHF across three data levels. MethodsThe GeoSense study enrolled 32 young sexual and gender minorities of color (ages 18-29) in New York City. IGEMA was implemented across three integrated levels: (1) GPS mobility tracking via participants personal smartphones, linked to census tract structural exposure indices across n=19 participants; (2) ecological momentary assessment of intersectional discrimination with multilevel modeling of mood, stress, and sleep outcomes; and (3) map-guided qualitative interviews with SIHF mechanism coding and intercoder reliability assessment across 92 coded records from 18 participants. This study was conducted as the pilot for NIH R01HL169503. ResultsAll three SIHF mechanisms were empirically detectable. A compound structural gendered racism index outperformed every single-axis alternative in predicting daily mood (b=-0.048, p=.001) and stress (b=0.121, p<.001). The Positional mechanism accounted for 71% of coded harm experiences. Intercoder reliability for mechanism assignment reached kappa=0.824 at Stage 2 reconciliation. Daily intersectional discrimination predicted greater sleep disturbance (b=1.308, p=.004). ConclusionsSIHF and IGEMA together provide an empirically testable framework for specifying how space produces intersectional health inequities. Mechanism specification, not spatial location alone, is the condition for designing research and intervention that reaches the source of harm for multiply marginalized populations.

18
Transportability of missing data models across study sites for research synthesis

Thiesmeier, R.; Madley-Dowd, P.; Ahlqvist, V.; Orsini, N.

2026-03-10 epidemiology 10.64898/2026.03.09.26347913 medRxiv
Top 0.1%
10.2%
Show abstract

IntroductionSystematically missing covariates are a common challenge in medical research synthesis of quantitative data, particularly when individual participant data cannot be shared across study sites. Imputing covariate values in studies where they are systematically unobserved using information from sites where the covariate is observed implicitly assumes similarity of associations across studies. The behaviour of this assumption, and the bias arising from violating it, remains difficult to qualitatively reason about. Here, we evaluated a two-stage imputation approach for handling systematically missing covariates using simulations across a range of statistical and causal heterogeneity scenarios. MethodsWe conducted a simulation study with varying degrees of between-study heterogeneity and systematic differences in model parameters. A binary confounder was set to systematically missing in half of the studies. Study-specific effect estimates were combined using a two-stage meta-analytic model. The performance of the imputation approach was evaluated with the primary estimand being the pooled conditional confounding-adjusted exposure effect across all studies. ResultsBias in the pooled adjusted effect estimate was small across scenarios with low to substantial between-study heterogeneity. Bias increased monotonically with increasingly pronounced differences in causal structures across study sites. Coverage remained close to the nominal level under low to substantial between-study heterogeneity, but deteriorated markedly as differences in causal structures between study sites became more severe. ConclusionThe two-stage cross-site imputation approach produced valid pooled effect estimates across a wide range of simulated scenarios but showed monotonic sensitivity to differences in causal structures across studies. The results provide insight into the conditions under which cross-site imputation may be appropriate for handling systematically missing covariates in research synthesis.

19
Early life blood pressure, cognitive function and brain aging in mid-to-late life: A synthetic longitudinal cohort analysis

Bustillo, A. J.; Zeki Al Hazzouri, A.; Glymour, M. M.; Kezios, K.

2026-02-26 epidemiology 10.64898/2026.02.24.26346790 medRxiv
Top 0.1%
10.1%
Show abstract

PURPOSEOver 6.9 million Americans above the age of 65 are living with Alzheimers Disease (AD) or related dementias (ADRDs), which are diseases characterized by cognitive decline and structural brain changes associated with accelerated brain aging. Cardiovascular risk factors, in particular hypertension, are well-studied risk factors for AD/ARD. Evidence suggests that the effects of hypertension on cognitive aging may vary by life stage, yet prior studies have focused on the effects of mid- or late-life hypertension or blood pressure, leaving other life stages, including early life, unstudied. However, owing to the logistical complexity of follow-up throughout the life course, cognitive aging cohorts lack early-life blood pressure exposure data and cognitive and brain aging outcome data in mid/late life. When such data are unavailable from any single data source, data fusion methods may be employed to pool two compatible data sources to impute an early-life blood pressure exposure history and produce a synthetic longitudinal cohort in which the associations between early-life blood pressure and mid/late-life cognition and brain aging can be estimated. The purpose of this work is to estimate the association between early-life blood pressure and mid- and late-life cognition and brain aging in a synthetic longitudinal cohort. METHODSWe pooled the Bogalusa Heart Study (BHS) to provide early-life blood pressure data (ages 4-16) and the CARDIA study to provide mid/late-life cognition & brain aging outcome data (ages 58-70) to generate a synthetic longitudinal cohort. Cognition was defined as cognitive domain scores (including executive function, memory, processing speed, and language) calculated by Z-transforming cognitive test scores within each cohort. Global cognition was calculated as the average of these Z-scores. Brain aging was defined using the Spatial Patterns of Atrophy for Recognition of Brain Aging, a measure of age-related brain atrophy using T1-weighted MRI scans. The cohorts overlapped in ages 17-57 for potential matching variables including blood pressure, sociodemographics, and vascular risk factors. Cognition overlapped between ages 41-58. We pooled data by distance-matching many-to-one (BHS to CARDIA) on mediators & confounders of each exposure-disease relationship that overlapped in age of measurement between the two cohorts. These variables included intermediate values of the exposure (blood pressure, ages 17-57), cognition (ages 41-58), in addition to sociodemographic and vascular risk factors. Linear regression models estimated the association between early life blood pressure & cognitive & brain aging outcomes. RESULTSBHS uniquely provided early life blood pressure data (ages 4-16), while CARDIA provided cognitive & brain aging data at ages 58-70. Matching is feasible between the ages of 17-57 on blood pressure, sociodemographics, and vascular risk factors, but 41-57 for cognition. CONCLUSIONSWe our results demonstrate the feasibility & suitability of two US-based cardiovascular cohorts for generating a synthetic lifecourse cohort to estimate early-life blood pressure and its association with mid/late-life cognitive & brain aging outcomes. Future studies should aim to use measures that more closely overlap between both cohorts. Additionally, future studies should interrogate greater spans, such as early life through late life.

20
Long COVID Prevalence among U.S. Adults: A State-level Ecological Analysis of the Contribution of COVID-19 Incidence, Severity of Acute Illness, COVID-19 Vaccination, and Chronic Conditions

Zhao, X.; Deng, L.; Ford, N. D.; Saydah, S.

2026-03-09 epidemiology 10.64898/2026.03.07.26347841 medRxiv
Top 0.1%
10.1%
Show abstract

BackgroundLong COVID has emerged as a major public-health concern in the United States, yet geographic variation in its prevalence remains poorly understood. This study examines how state-level differences in COVID-19 vaccination, SARS-CoV-2 incidence, COVID-19 hospitalization, and chronic disease burden relate to adult Long COVID prevalence in the United States. MethodsWe conducted an ecological analysis using data from the 2023 Behavioral Risk Factor Surveillance System (BRFSS), from which we estimated state-level prevalence of self-reported Long COVID among adults. These estimates were linked with publicly available data on SARS-CoV-2 incidence, COVID-19 hospitalizations, COVID-19 vaccine coverage, and a multimorbidity indicator (>= 3 chronic conditions e.g., diabetes, obesity, chronic kidney disease) associated with higher risk for severe SARS-CoV-2. Multivariable linear regression models were fitted to assess the contribution of each factor adjusted for age and sex distribution, incorporating Rubins rules to account for uncertainty in prevalence estimates. ResultsAll examined factors--including SARS-CoV-2 incidence, hospitalization rates, and multimorbidity, vaccine coverage--varied by state. When modeled simultaneously and adjusting for age and sex distribution, only COVID-19 vaccine coverage and SARS-CoV-2 incidence were significantly associated with Long COVID prevalence. COVID-19 vaccine coverage showed a strong protective association, while SARS-CoV-2 incidence showed a modest positive association. Multimorbidity and hospitalization rates were not independently associated with adjustment. ConclusionsState-level variation in Long COVID burden appears most strongly driven by COVID-19 vaccine coverage and SARS-CoV-2 incidence. Promoting COVID-19 vaccination remains essential to reduce long-term impacts from SARS-CoV-2 infections.